-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OpenMP for FFTW #541
Add OpenMP for FFTW #541
Conversation
3f3fce4
to
8b4ba1c
Compare
Add a heuristics that also works with PkgConfig to query OpenMP support in FFTW. Enable by default if we build with the OpenMP compute backend unless explicitly disabled. Add a macro to control the source-code, since FFTW does not offer a public define for this.
Just documenting here: for best results, use close-by pinning, esp. with MPI, and avoid oversubscription of cores, esp. if no hyperthreading is available: export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export OMP_NUM_THREADS=1 # 1,2,4,... |
@SeverinDiederichs I just realized I did not try single precision, did that work as well for you? |
fftwf_plan_with_nthreads(omp_get_max_threads()); | ||
# else | ||
fftw_init_threads(); | ||
fftw_plan_with_nthreads(omp_get_max_threads()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add, just to expose even more control, a runtime parameter that can overwrite the value passed to ..._nthreads()
from the inputs file.
The default would be the heuristic you already added (1
of <32**2
cells and omp_get_max_threads()
otherwise), but it could add a useful intermediate layer of control in case we want to set the FFT parallelism independent of the rest of the sum that is controlled by OMP_NUM_THREADS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this will be an interesting addition. After an offline discussion with @MaxThevenet, I will merge this PR as is and add this feature as soon as we have other openMP acceleration. As it is the only function using openMP, we have currently full control with OMP_NUM_THREADS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I forgot this is the first OpenMP accelerated part 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for this PR!
message(STATUS "FFTW: Found OpenMP support") | ||
target_compile_definitions(HiPACE::thirdparty::FFT INTERFACE HIPACE_FFTW_OMP=1) | ||
else() | ||
message(STATUS "FFTW: Could NOT find OpenMP support") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lovely!
I also tested single precision, it works well and shows the same behaviour 👍 |
Add a heuristics that also works with PkgConfig to query OpenMP support in FFTW. Enable by default if we build with the OpenMP compute backend unless explicitly disabled.
Add a macro to control the source-code, since FFTW does not offer a public define for this.
Thanks to @ax3l FFTW with OpenMP now works:
On my laptop, using the beam in vacuum example without I/O and without beam at
amr.n_cell = 1024 1024 50
I get the following run times:
This implementation seems to give the correct results. Using 2 threads, all tests pass locally:
As a comparison, here the run time on development on my laptop:
next_deposition_beam.2Rank
test takes so much longer than running in serial. The reason lies in its extremely low resolution of 16 transverse grid points. Obviously, it does not make any sense to use more than 1 CPU for the FFT there. Using 32 grid points is slightly slower with 2 threads. Increasing the number of grid points to 64 yields in a speed up with more threads again. Therefore, a threshold ofnx > 32 && ny > 32
was added. If it is not met, the FFT is executed with a single thread.const
isconst
)